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Abstract. It is well-known that non-linear approximation has an advantage 
over linear schemes in the sense that it provides comparable approximation 
rates to those of the linear schemes, but to a larger class of approximands. 
This was established for spline approximations and for wavelet approxima- 
tions, and more recently by DeVore and Ron [2] for homogeneous radial basis 
function (surface spline) approximations. However, no such results are known 
for the Gaussian function, the preferred kernel in machine learning and several 
engineering problems. We introduce and analyze in this paper a new algo- 
rithm for approximating functions using translates of Gaussian functions with 
varying tension parameters. At heart it employs the strategy for nonlinear 
approximation of DeVore - Ron, but it selects kernels by a method that is 
not straightforward. The crux of the difficulty lies in the necessity to vary 
the tension parameter in the Gaussian function spatially according to local in- 
formation about the approximand: error analysis of Gaussian approximation 
schemes with varying tension are, by and large, an elusive target for approx- 
imators. We show that our algorithm is suitably optimal in the sense that it 
provides approximation rates similar to other established nonlinear method- 
ologies like spline and wavelet approximations. As expected and desired, the 
approximation rates can be as high as needed and are essentially saturated 
only by the smoothness of the approximand. 



1. Introduction 

1.1. Nonlinear Radial Basis Function Approximation. In this article we con- 
sider TV-term approximation by Gaussian networks, an approximation technique 
widely used in statistics and engineering. This is an example of nonlinear approxi- 
mation since we select d-variate functions residing in 
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which (failing to be closed under addition) is not a linear space. This stands in 
contrast to the linear approximation problem, often studied in radial basis function 
(RBF) theory, where the centers (cj)j, are predetermined and approximants are 
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chosen from a linear space 

span <pj(- — Cj) = span exp 

l<j<N l<j<N 

that depends on the set of centers. 

Hcuristically, the benefit of the nonlinear approach is that by placing centers 
strategically, one may overcome defects, like discontinuities, cusps or other local 
deficiencies in smoothness, of the target function /. Because such defects may be 
manifested in a variety of ways, over regions or on lower dimensional manifolds, and 
may occur at different scales, finding a precise strategy is not at all straightforward. 
In this article, we present a method for placing centers in a way that is suitable for 
creating effective nonlinear approximants. 

An important distinction between the nonlinear and linear problems is in how 
convergence is measured. In the linear setting, the main approximation param- 
eter measures density of the centers, usually by means of the "fill distance" h — 
max l£ f! dist(x, (cj )j) ; the underlying approximation problem is to measure the rate 
of convergence as h shrinks. In high dimensions, the assumption that centers fill a 
(high dimensional) region fi with a small fill distance is computationally imprac- 
tical. In nonlinear approximation the rate of convergence is measured against the 
parameter N, the cardinality of the set of centers. This approach lends itself to 
more frugal approximation in high dimensions. 

The approximation scheme we introduce selects s/.at from &n, and is shown 
to have convergence rate ||/ — Sf t ^\\p = 0{N~ s / d ) for target functions / having 
L T smoothness s, with ~ = g + ~. Generally speaking, such nonlinear estimates 
are sharp in the sense that they are similar to known results for nonlinear wavelet 
approximation, and one cannot expect to achieve a similar rate dist p (/, Gjy) = 
0{N~ s / d ) by decreasing either r or the underlying smoothness. 

To provide a more robust space of approximants, we permit the tension (aka 
shape or dilation) parameters <jj to respond to the nonuniform distribution of the 
centers. The question of how to tune a tension parameter is of active interest to the 
Learning Theory community, jTOl [TT] , as well as the RBF community [JJ [T] , but in 
most theoretical works, the tension parameter is taken to be constant for all centers. 
Although the spatially varying tension parameter is a natural idea, and is used in 
practice [HJ [9] , it has heretofore not been considered seriously in an approximation 
theoretic sense. Although it may be tempting to use tight dilations when the centers 
are dense, essentially setting <jj proportional to a local spacing of centers around 
Cj, the manner in which our scheme sets the tension is more complicated, but one 
that is ultimately justified by the error estimates we provide. In any case, we note 
that there is some empirical evidence [4j Section 3] that Gaussian approximation is 
unstable without adjusting the tension. 

Nonlinear approximation with RBFs has not been investigated with the same 
intensity as other basic elements of approximation theory (splines, wavelets, etc.). 
Recently DeVore and Ron [2] (employing a idea on which we have modeled our 
method) have made a first foray into nonlinear RBF approximation using RBFs 
that are fundamental solutions of elementary, homogeneous, elliptic PDEs. Such 
RBFs, which include the "surface splines," allow simple but elegant approximation 
schemes that are not burdened by the requirement that the target function must 
reside in the native space. In addition, the homogeneity of these RBFs means that 
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the iV-term approximation spaces are, essentially, invariant under rescaling and, 
thus, there is no need to select dilations Cj - this is done automatically. However, 
many prominent RBFs, including the Gaussians, do not fall into this category. For 
the kernels considered by DeVore and Ron, the approximation order is saturated, 
meaning that for this method there is an upper bound on the rate of convergence: 
by increasing smoothness beyond a saturation level k (determined by the order of 
the elliptic differential operator inverted by the kernel) there is no corresponding 
increase in the rate of decay of the error. This is not so with Gaussian kernels. 
Furthermore, the kernels used by DeVore and Ron are dependent on the operator 
they invert, and, hence, (subtly) dependent on the spatial dimension. This is a 
hindrance which the Gaussians also avoid. 

1.2. The Methodology. As in [2], to construct the iV-term approximant sn, we 
begin with a wavelet decomposition of the target function / = fiipi- Based on 
the size of the wavelet coefficient and the smoothness norm of the target function, 
the fixed budget of N terms is distributed over the elements in the expansion - into 
individual budgets Ni (many of which are zero). Each wavelet ipj is then approx- 
imated by a linear combination s/ of Gaussians that uses at most Nj terms. The 
full N term approximant is then s* jv = fi s i- The main idea is that we have a 
scheme for nonlinear approximation associated with this family of wavelets that can 
be lifted to the Gaussians by means of approximating the individual members of 
the family. Matters are simplified when we assume the entire family to be generated 
from a few prototypes via dilation and translation: our collection of Gaussians are 
invariant under these operations! This reduces the problem of efficiently approxi- 
mating all members of the wavelet family to the problem of approximating a few 
fixed wavelets by linear combinations of Gaussians. 

The crucial issue is to approximate a basic function ip using a linear combination 
of N shifted Gaussians. We view the number N as the portion we are willing to 
invest in approximating ip out of our total budget of centers. It is essential to 
understand how to apportion the budget, and this can only be accomplished when 
we have good iV-term error estimates. Thus, we are interested in understanding 
how to approximate globally using only finitely many centers. This is a very hard 
problem for the Gaussian. We completely resolve this problem for a function ip 
that is band-limited, and in addition, has rapid decay: 

for every k there is a constant Ck such that 1^(^)1 < Cfc(l + M) - * 1 - 

The trick we employ is to create an approximant X)ae/iZ d a i a i h)<p(- — a ) that con- 
verges rapidly (globally) to ip in the norm, with coefficients a(a, h) that are 
roughly the same size as ip(a). Then we modify this approximation scheme by 
throwing away centers from a region where ip is small. This is where the two as- 
sumptions on the wavelet - that ip is bandlimited and that it is rapidly decaying - 
come into play. Bandlimiting means that the "full" approximation scheme (using 
centers hZ d ) has coefficients a(a,h) that can be expressed as the convolution of ip 
with a Schwartz function. Rapid decay allows us to attribute polynomial decay of 
arbitrary orders to the coefficients. 

1.3. Organization. In Section 2 of this article, we develop the basic linear approx- 
imation scheme at the heart of our approach. First considered is the operator T^, 
which generates the 'full' approximant, an infinite series of Gaussians having the 
grid hZ d as the set of centers. Second we develop the operator T^, which generates 



4 



THOMAS HANGELBROEK AND AMOS RON 



the 'truncated' approximant - a linear combination of roughly h~ 2d Gaussians . At 
the end of Section 2 we generalize to treat scaled wavelets using a fixed budget of 
N centers. This is the role of the map T^- Corollary @] gives the error for wavelets 
at all dilation levels. 

Section 3 treats nonlinear approximation in L p for 1 < p < oo. Results match 
those obtained for surface splines in [2]. This involves a sophisticated strategy for 
distributing centers, which is expressed in Section 3.1. The main result is Theorem 
IHlin Section 3.3. 

Section 4 treats nonlinear approximation in , a case was not considered in [5] . 
For technical reasons, we consider approximation of functions from Besov spaces in 
this section. The main result in that section is Theorem [T2l 

1.4. Notation and Background. We denote the ball with center c and radius R 
by B(c, R). The symbol I cM. d will represent a cube with corner at c(I) £ M. d and 
sidelength £(I) > 0: it is the set c(I) + [0,£(I)] d . We denote the volume of a set 
inM d by|fi|. 

The natural affine change of variables associated with a cube / is denoted with 
the subscript /: i.e., for a function g : R d — > C, 

f x-c(I) \ 
9x{x) :=g\——y 

The symbol C, often with a subscript, will always represent a constant. The 
subscript is used to indicate dependence on various parameters. The value of C 
may change, sometimes within the same line. 

For Schwarz functions, the d-dimensional Fourier transform is given by the for- 
mula /(0 = / R- f(x)e~ i ^ dx, and its inverse is f(x) = (2n)- d J Rd /(£)e~^'€> d£. 
An important property of the Gaussian functions 

(j) a : x n- exp[-|x/cr| 2 ], (1) 
is that they satisfy <f> a — {cr\/T^) d 4>[2/a)- 

2. Shift-invariant Gaussian approximation of band-limited functions 

2.1. Approximation using infinitely many centers. Let B C K d be a fixed 
ball centered at the origin. We denote by 

H B (2) 

the space of all Schwartz functions whose Fourier transform is supported in B. Let 
<j> be the d-dimensional Gaussian function, dilated by a fixed (arbitrary) dilation 
a > (cf. (fT])). Given h > 0, consider the linear space 

S h := S h (4>) ■= span{ <?(>(• -a): a E h% d }, 

closed in the topology, say, of uniform convergence on compact sets. 

We consider in this section approximation schemes and approximation errors for 
functions in Hb from the space Sh- We adopt to this end the approximation schemes 
of pQ, and show that in our setup these schemes provide superb approximations to 
the class Hb '■ the error decays exponentially fast as the spacing parameter h tends 
to 0! 

Let us fix now / £ Hb, and h > 0. We denote by the function whose Fourier 
transform is f/<fi. We note that is in Hb, since /</> = /* ?7$ for a Schwartz 
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function ry^ (that depends only on (f> and B) and Hb is an ideal in the Schwartz 
space. We then approximate / by h d T^f , with 

T "/ ; = (h) d £ /*(«)*(■-«)■ ( 3 ) 



Our main result in this subsection is the following: 

Proposition 1. Let B = B(0, R) be the ball of radius R centered at the origin. The 
uniform error in approximating f € Hb by h d T\f as above satisfies, for h < ir/R, 

llZ-^/lloo^qi/IU^V*. 

The constants C and c depend on R and the dilation parameter a used in the 
definition of '(f), but are independent of f and h. 

Proof. Using the fact that f^ = f/(f>, we write h d T^f as 

f(6)k h (6,-)d6, 



J 



IR d 

with 

k h {9,z):={2v)- d S- <f>{z-a)e^. 

Invoking the Poisson summation formula (which obviously is valid for the Gaussian 
function), we obtain that 

k h (e,z) = (2?r)- d i^ J2 HO + P)e i{z ' 0) - 

When applying the above kernel to /, we are allowed to do the integration term- 
by-term, with the ((3 = 0)-term yielding the original function /. Therefore, 



f{z)-h d T{f{z)= [ f(6)k' h (6,z)d9, 

JR d 



with 

k' h {e,z)^{2^)- de ^- y, ko+py {z ^. 

pe2nZ d /h\{o] 

Note that the kernel is integrated only over 8 <G B, since supp(/) C B by assump- 
tion. Thus, we obtain that 

\\f-h d Tlf\\ 00 <(2n)- d \\f\\ 1 K hl 

with 

K h ~ £ II^O+WII^CB). 

Pe2TTl. d /h\{Q} 

Let R denote the radius of B. If 2R < \f3\ then, for £ e B, |£ + (3\ 2 - |£| 2 > 
{\(3\ - 2\i\f. Consequently 

K h < C^(a), 

for a < dist 2 (2 J B,2 7 rZ d //i\{0}) = 2(n/h-R). □ 
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2.2. Approximation using finitely many centers. In this subsection, we mod- 
ify the approximant of the previous subsection and use only a finite number of 
centers. This is a necessary step for us, since our budget of centers is finite. Our 
approximand is still a function / £ Hb- 

Our setup is as follows. Given / and a mesh-scaling parameter h, we will ap- 
proximate / by /i^X^/, with 

7t/:=(2^)- d /*(«)#•"«). ( 4 ) 

a£h7, d nB h 

with f^ and <f> as in the previous subsection, and Bh is a ball of radius l/h. The 
crux here is the correspondence between the mesh size h, and the radius l/h of 
the domain of the shifts we "preserve": T^f is obtained from T^f by removing 
from the sum all shifts outside a ball of radius l/h. Note that the number of shifts 
N := N(h) that are being used for a given h satisfies 

N ~ h- 2d 

with constants of equivalence depending on d only. At the end, we need to control 
the error in terms of the parameter N. For the time being, we still write the error 
in terms of the mesh size h. 

Once the approximation operator uses the above truncated sum, one cannot 
expect the error to decay exponentially fast as in Proposition [1] However, the new 
error, measured in the uniform norm, still decays rapidly Q 

Lemma 2. Let k > 0, and f € Hb- Then there exists Cf t k > that does not 
depend on h such that for all small enough h 

\\f-h d Ttf\\ 00 <C f , k h k . (5) 
Proof. Thanks to Proposition [TJ we only need to show that 

h*\\T*f-l%f\\ 00 <Ch k . 
However, the norm \\T^f — X^/Hoo is bounded above by the sum 

\a\>l/h,a£hZ d 

Since f$ decays rapidly at oo, the above sum is 0(h k ) for any fixed k, and our 
claim follows. □ 

The uniform error bound that we just obtained is not refined enough for our 
purposes. We will need better estimates for the error away from the origin, i.e., 
outside the ball Bh of radius l/h. Indeed, such estimates are valid, but require a 
different argument: 

Lemma 3. Let k > 0, and f G Hb- Then there is a constant C' k > (depending 
on k, d and f but independent of h ), so that the function T^f from Lemma 
approximates f with pointwise error: 

\{f-h d Tlf){x)\<C' k h k (l + \x\y k . (6) 



^We could have made the dependence of Cf j, below on / more explicit. However, this is not 
needed for our subsequent applications. 
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Proof. If \x\ < 2/h, then 

(l + \x\)- k >(h/5) k , 

hence the requirement here follows from the inequality in Lemma [2] when k there 
is replaced by 2k. 

For the case |x| > 2/h, we may prove that 

\(f-h d Tlf)(x)\<C' k (l + \x\)- 2k , 

since 

h k (i + \ x \y k > C\x/2\- 2k . 

To this end, we estimate the difference 

f{x)-h d Tlf{x) 

directly. First, / decays rapidly, by assumption, hence certainly satisfies the re- 
quired estimate. As to h d T^f , we note that, since decays rapidly, the sum 

h d £ \U(a)\ 

a£hZ d 

is bounded, and the bound can be made independent of h (the bound is, essentially, 
the Li-norm of /^). Thus, we can bound T^f(x), up to an /i-independent constant, 

by 

max{(/>(x — a) : \a\ < l/h}. 
Since |x| > 2/h, \x — a\ > \x\/2, hence 

h d Tfj(x) < C' k 4>{x/2). 

Thus we are left to show that 

4>{x) < \x\- 2k , for \x\ > 1/h, 

for small enough h, which is clearly valid due to the exponential decay of 4> at 
oo. □ 

2.3. Gaussian approximation of a wavelet system. We now assume that we 
have in hand a finite collection C Hb, with Hb as in the previous section. Then, 
Lemma [3] holds for each / := tp £ Considering ^ as the set of mother wavelets in 
a suitable wavelet system, we need also to develop suitable approximation schemes 
for shifted dilations of if), i.e., we need approximation schemes and error bounds for 
functions of the form 

if}((- - c)/i), ipef, ceR d , £>0. 

However, such schemes are trivial: since we are allowed to use shifted-dilated ver- 
sions of our original Gaussian <p, we may simply use the approximation 

^{(■-c)/£)^(h d T^)((.-c)/£). 

Note that T^ip employs N ~ h~ 2d centers. Fixing N momentarily, we define a new 
map, Tjy, that is defined on all dilated shifts of each ip £ by 

{T N ^){{--c)/i)) :=iV-V2(T^_ 1/(2d ^)((.-c)/£). (7) 

The error bounds of the previous section apply directly here. We just need to 
replace each occurrence of h by iV~ 1 ^ 2d ^. Thus, we obtain: 
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Theorem 4. Let if) G Hb be given and finite. Let k > 0, and let I be a cube. 
Then, there exists a constant C independent of N and L such that, for every N 
sufficiently large, and for every I as above, 



In the previous section, we derived error estimates for the approximation of each 
member of a bandlimited smooth wavelet system by suitably chosen N shiftcd- 
dilated Gaussians. Armed with these error estimates, we finally tackle in this section 
our central problem: approximating a general function by finitely many shifted- 
dilated Gaussians. Our approach follows [2] and is similarly based on approximating 
the wavelets in the wavelet expansion of the actual approximand. To this end, we 
choose first any, say orthogonal, wavelet system whose mother wavelets are all 
bandlimited Schwartz functions. We define below MRA systems and wavelets in 
the exact way that fits our needs. Let us stress that the actual definitions of wavelet 
systems are far more flexible. 

Definition 5 (Wavelets). Ln this article a univariate wavelet system is an orthonor- 
mal MRA wavelet system whose generators are bandlimited Schwartz functions: a 
scaling function rjy and a (mother) wavelet r\\, both bandlimited Schwartz func- 
tions. See [51 3.2] or [5] for a possible construction. Multivariate wavelet systems 
are tensor products of a univariate one, hence its wavelets are indexed by (I, e), an 
ordered pair consisting of a dyadic cube, L, and a gender e G £ e {0, l} d \ {0}, 
corresponding to one of the (non-origin) corners of the unit cube [0,1]^: 



We denote by T>j the subset of dyadic cubes with common edgelength 2 J . 

The wavelet ipi te is an affine change of variable (as in Section 1.4) of the mother 
wavelet tp e = ipi . ei for some e € £. Since we use more than one mother wavelet 
(indeed, we use = 2 d — 1), we regard T> and T>j as multisets and we suppress 
dependence on the gender e. Thus, the notation tpi stands for the /-version of 
any of the mother wavelets, and a summation over T> or over one of its subsets, 
unless otherwise noted, is assumed to take place over £ as well. This does not cause 
any confusion, since in this section our algorithms and their analysis do not pay 
attention to the details of the actual mother wavelet that is employed. 

Our problem is then the following basic one. We are given a smooth function / 
(from some smoothness class, see below) and a budget of N centers. We are then 
allowed to approximate / by a total of N shifted-dilated Gaussians. We carry out 
this approximation by distributing the centers across the wavelet system: for each 
/ £ D, we allocate Nj centers as "the /-budget" and use these budgeted centers 
for approximating the term fiipi in the wavelet expansion 




3. Nonlinear Approximation in L p , 1 < p < oo 



Let T> be the collection of all dyadic cubes, viz., with /o the unit cube, 



V := {V{k + I Q ) : j eZ, ke Z d }. 




(8) 



lev 
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The individual error when approximating fiipi by Ni Gaussians was the subject 
of the previous section. Thus, our analysis here will focus on the estimation of 
the cumulative error. But, first and foremost, we need to devise an algorithm for 
distributing the budget across the different wavelets. We refer to this algorithm as 
the cost distribution. 

3.1. Triebel-Lizorkin Cost Distribution. It is convenient to associate each 
wavelet with cost cj > that is not necessarily an integer, and then to deter- 
mine Nj from the formula 



N[ := 




[ci\ > N , 
otherwise, 



where iVo is a some fixed integer, that depends on the wavelet system and on 
nothing else. 

We now discuss the cost distribution c/, which depends on several factors. In 
addition to the volume of the dyadic cube, |/|, it depends on the wavelet coefficient 
//, the smoothness norm of / (defined below), and an estimate of the size of a 
partial reconstruction of /. To this end, we make the following definitions: 

Definition 6. Given s, q > 0, we define the maximal function M Sj9 / as 

M s , q f(x) := (j2 \I\- sq/d \fi\ q Xi(x)) ■ (9) 
Kiev / 

For a dyadic interval I, we define a partial function by 

1/9 

M Stq jf(x) := I V \I'\-^Vi'\ q Xi'^)} ■ (10) 



( £ \I'r q/d \fi'\ q Xr(x)\ 
\icrev / 



Given now r, s,q > 0, we define the Triebel-Lizorkin space F£ via the finiteness of 
the following quasi- seminorm: 

\f\ F s q := \\M S J\\ T . (11) 

We note that for any interval J, the partial maximal function M s ^ q jf is nonneg- 
ative and always < M s ^ q f . Furthermore, it achieves its maximum on the interval J, 
where it is constant. Thus the number m s ,q,i '■= M s ^ q j f(x), x £ I, is well-defined, 
and m s , q j = sup yeR d M Stq j f(y) < M Stq f{x), x 6 /. In the definition below, s 
stands for the smoothness of the function we approximate, and p for the norm in 
which we measure the error. 

Definition 7 (Cost Distribution). Let s > 0, and p > 1. Define r,q by 1/r := 
l/p + s/d and l/q := 1 + s/d. Let f G F", with wavelet expansion We choose 
then the cost of a dyadic cube I E T> as 

^■■=\f\7 lq <; q jm q \i\ q N. (12) 

Let us first verify that the sum of all the costs is our budget N: 

E c ' = E i/i^ s m Z q i \M q \i\^ qs/d N. 

lev lev 
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Since |/| = J Rd xi{x) dx, we can write the right hand side as an integral, namely 
as |/IfJ Nj Rd £/ \fi\ q \I\- qs ' d xi{x) dx. Invoking the fact that, for x € I, 

m s,q,i < M Stq f(x) (and that r > q), gives 

< \f\ F : N f (M s , q f(x)) T dx = N. 



3.2. Approximating the Wavelet Expansion. Once a budget of Nj centers is 
allocated for the approximation of the term fiipi in the wavelet expansion of /, 
we appeal to Theorem [4] to conclude that the term can be approximated by Nj 
Gaussians with error that is bounded (up to a constant that depends only on the 
wavelet system and on the parameter k) by \fj\Rj, where 

— 2k 

^ < nsl [ .r. I i \ 



Ri{x) := C k . d mm(l,N I k/d 
< C^min(l,c7 fc/d ) 



1 



t(I) 
dist(a;, I) 

W) 



-2k 



(13) 



The following lemma, which is proved in the next subsection, simplifies the above 
error: 

Lemma 8. Let 1 < p < oo, then 





< Ck,d 


E min(l,c J k/d )fiXi 


lev 


p 


lev 



We are ready to state and prove our main result concerning the case 1 < p < oo. 

Theorem 9. Given s > and 1 < p < oo, there is a constant C PtS .d so that for 
f € F£ q , with 1/t = 1/p + s/d and l/q = 1 + s/d, there is a linear combination of 

N Gaussians s/(x) :— X}^=1 A? ex P[~ ( ) ] so ^at 

\\f-Sf\\ P <C p , s , d N- s / d \f\ F . q . 

Proof. Using the coefficients of the wavelet expansion ([5]), we can express s/ as 

Sf ■= ^2fiT Nl ip!, 



where each term, [T^ipj] (&) = YlfZi a i ,j ' exp 



W) 



, defined in ([7]), is 



composed of Nj Gaussians by the construction preceding Theorem 0] (note that 
the notation c/ stands for the /-cost, and is very different from the notation ci.j 
above). By the enumeration at the end of Section 3.1 (^2 IeT) Nj < N), we know 
that no more than N Gaussians are used. 
From Lemma [5] we have the error estimate 



\f-8f\\ p <C, 



k.d 



E min(l,c / fc/<i )|//|x/ 
lev 
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As long as k (which is arbitrary) is greater than s, we can estimate the error as the 



L p norm of a series: 



where Ej(x) := c 7 s ^ d \fj\ xi{x)- We now focus on estimating this series, pointwise. 
By applying the definition of c/ , we obtain (after some elementary manipulation 

7 9 \fi\ q \I\~i s / d N- s / d . We recall that the I 



\f-s f \\ p <C h „ 



lev 



of exponents), Cj s/d \fi \ = \f\ F 
partial square- like function is constant on the cube /, where it equals m qiS ,i- This 
implies that X i(x)m qjS ,i = Xl{ x )M qtSl which shows that each term is 

Ej(x) = N-''*\f\yJ d M q , a j(x) r /*-*\fi\* \I\-^ d Xl (x). 

The series becomes much more manageable by making some simple substitutions. 
Writing the basic summand of the maximal function as Zj := \fi\ q \I\~ qs / d Xi(%)> 
the partial sum of these basic summands, Zi := X)/c/' Zl '' IS observed to be the 
q th power of the partial maximal function Zi = (M St qj(x)) q ', while the full sum 
of these, Z := X)/g , d Z/ ' * s s i m ply the qth power of the (full) maximal function 
Z = (M s ^ q (x)) q . It is a simple observation that the full series under consideration 
now has the compact form 

j2E I (x) = N-^\f\ r ;j d j2^zp- 1 . 



Ts/d T/p—q 



m 



th 



lev 



lev 



It follows from the inequality J2iev z i^i 1 — C e Z e , valid for nonnegative sequences 
(zi)iev and < e with constant C £ < oo (this is j2j[Lemma 6.3]), that 

J2 Ej(x) < Cf .N-^\f\ F f q ((M s , q ( X )Y)^ = C p , s4 N-^\f\ F f q {M s , q { X )Y /p . 
lev 

Taking the L p norm controls the error: 



since rs/d + r/p = 1 . 



dx 



< 



c P , 3 , d N- s / d \f\ T ;/ d 



= a 



p,s,d 



N -S /d ^Ts/d+T/ P = a 



\ VP 

(M s , q (x)) T dx) 

N- s/d \.f\F°. q 



p,S,d-L 



□ 



3.3. On Lemma[8l The vector-valued maximal inequality of Fefferman and Stein, 
J3[ Theorem 1], controls the L r (£ s ) norm of the sequence of functions (MFj)j by 
the L r (£ s ) norm of (Fj)j, provided 1 < r, s < oo (the operator M is the usual 
Hardy-Littlewood maximal operator MF(x) := sup^^d j^ja Jj 0j &]d \F{y)\ dy ): 



E 



MFj(x)\' 



1/3 



< C r 



Ei^^i" 



1/8 



In the lemma we make use of a minor generalization of this for the modified maximal 
operator M T , defined for < r < oo by 



M T f(x) := sup 

xe[a.b] d 



1 



(b- 



a,b] d 



\f(y)\ T dy 
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It is not difficult to show that for r < p, q < oo 

1/9 



^'M T / 3 -(x)|« 



i/'/ 



(14) 



Indeed, this follows by a direct application of the Fefferman-Stein inequality with 
s = r = ^ (both greater than one), = fj and = cV.I , because the 
modified maximal operator is related to the Hardy-Littlewood maximal operator 
by [MFj] s = [M T fj} q and the r and p norms are related by ||g|| r = H.g 1 ^ 7 

Proof of Lemma From (Q~3|) , it follows that 



1 1 p ■ 



Ri < C M min(l,c7 fc/d ) ( 1 



dist(x, I) 

W) 



-2k 



Observe that there is a constant, C'd, depending only on d so that 

-d 



1 



dist(x, J) 

W) 



< M( X i)(x). 



We can assume k > d/2 without loss of generality. For r between d/2k and 1 we 
have 



1 



dist(:r, /) 

W) 



-2k 



< 1 



dist(ir, /) 

W) 



-d/r 



< C d M T ( X i)(x), 



since M T {xi) 
(fill) , that 



M(xi) 1 ^ T ■ It follows from the modified Fefferman-Stein inequality 



lee 



< a 



)M T (x/ 



< 



C M ||min(l,c J ' s/d )|//|xi 



□ 



4. Nonlinear Approximation in 

Although the basic strategy for nonlinear RBF approximation in L m is, at heart, 
the same as in Lp, there are some complications that require us to give it a slightly 
different treatment. The fundamental difference is that the Hardy-Littlewood maxi- 
mal inequality (and, hence, its vector valued analogue, the Fefferman-Stein inequal- 
ity, used in the previous section) does not hold for p = oo. For this reason, we choose 
to work with family of smoothness spaces that do not require us to explicitly work 
with maximal operators. Smoothness is measured using a Besov norm, and we use 
a Besov space based cost distribution to determine how to distribute the budget 

Definition 10. For r = d/s G (0,oo) and q G (0,oo), the Besov space B^ is the 
space of L T functions for which the (quasi-) seminorm l/lsj^ is finite, where 



l/k„ 



k h-> 



E im t 



1/7 



\iev k 



2 The Besov space approach is valid for the case p < oo that was analysed in the previous 
section, too. However, the Triebel-Lizorkin space F" q is slightly larger than the Besov space of 
the same parameters. 
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Here, the coefficients (//) are as in 

Note that for q < r < 1 , / G B S T implies that the wavelet coefficients fi are 
absolutely summable. Since the wavelets tpi are uniformly bounded, this means 
that the wavelet expansion §8§ is absolutely convergent for s > al and / 6 B S T 
(meaning that the main issue for L m approximation is resolved in this case). For 
1 < t < oo and q < 1, we also have unconditional convergence of the wavelet 
expansion, since 

E E \m\m*)\ < E ( E im t ) ( E i^wr' 

Because \ipj(x)\ T < C(l + \x — c(I)\/i(I)) the second factor is bounded with 

a constant depending on d and totally independent of k and x. Thus, the right hand 
side is controlled by |/|s« . This is a reflection of the fact that B S T q is embedded 
in Lao for t = d/s and g < min(r, 1). Although has no unconditional basis, 
the Besov space does; the wavelet expansion (j8]) converges unconditionally in these 
cases. 




4.1. Besov Cost Distribution: The approach we take for treating error is 
to alter the strategy for budgeting slightly. As before, for each wavelet ipj, we 
create an approximant Tn^i using a portion Nj of the total budget N, but the 
precise distribution of this budget follows different rules. We rely again on a cost 
distribution. In this case, it is: 



«^i/i- /n /,r = iv^j iff) . as) 

The indices r and q are determined by 1/r = s/d and 1/q = 1 + s/d. The 
quantity Aj is a sort of "energy" of / at the dyadic level j: 




We do not invest in the wavelet corresponding to I if [cj\ < Nq (the constant from 
Lemma 13]). Thus, we set 



M = N\f\el Ay\ftf , LczJ >N 



Nj = 

With this choice at most N Gaussians are used: 



3 

otherwise. 



(16) 



X> < E c ^^LE^r E im t 

lev igt> j=o ieT>j 

oo 

- N \n B \X A T TA ] = N - 

3=0 
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4.2. Approximating the Wavelet Expansion: The following lemma is a rough 
analogue to Lemma where we show that the investment of centers made in one 
"energy level" gives a suitable error. 

Lemma 11. For a finitely supported sequence of coefficients [ai)i^x>j, we have the 
estimate, which holds for 2k > d: 



E o-iipi - E a/pV^/] 
ieT>j lev. 



< C k ,d sup 

iGVj 



a T N} 



-k/d 



Proof. We treat the estimate by considering the error termwise. Theorem 0] gives 
the pointwise bound 

E M - MA (*)l < C k4 E \ai\Ni k/d (l + ^El?T 

/en,' ieVj > * ' 

By applying Holder's inequality, the lemma follows, since for 2k > d, the series 

I + d ' S £('/) / ^ i s bounded by a finite constant C k ,d that is independent 

□ 



E/e-Dj ( 
of both j and x 



We are now in a position to prove our main result for approximation. 

Theorem 12. Given s > 0, there is a constant C St d so that for f 6 B^ q , with 
1/t = s/d and 1/q = 1 + s/d, there is a linear combination of N Gaussians 

sj(x) '-=J2f = iAj exp —(^-^ J -) 2 so that 

||/- S /||oo<C M 7V- S / d |/| B?g . 

Proof. Using the budget (|16|) . the approximant is 

Sf ■= E fi T Nii>i, 

where each term, [Tn^i] is composed of Nj Gaussians as in Theorem [4] 

We estimate ||/ — s/||oo, recalling the unconditional convergence of the wavelet 
expansion for functions coming from the Besov space for this choice of r and q. 



\f-s f \ 



< 



E 



j = -oo 



E •MV'i - Tn^i) 

oo 

< C(k,d) E \\l^N- k/d \fj 

j=-oo 
oo 

< C(k,d) e \\i^ci s/d \fi 



The first inequality is simply the triangle inequality, since the sums considered are 
all finite, while the second is Lemma 1111 The final inequality holds for s < k, 
because cj < 1 + iV> < CiV>. 
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By invoking the definition of c/, and by manipulating exponents (specifically, 
using the facts that rs/d = 1 and that s/d + 1 = 1/q) we arrive at 



11/ -«/||oc < C(k,d)N->' d \f\% d t J2 A< f~ q)S/d 

j=-oo 
oo 

= C(k,d)N-°/ d \f\ q BC E A 



I^\fi\ 1 - Ts/d 



(r—q)s/d 
j 



= C{k,d)N-'' d \f\%i d ]T A] = C(fc,rf)Ar- s / d |/| s;(; 



j=-oo 
oo 



j = -oo 



□ 
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